Abstract

The purpose of this project was to construct a predictive model that could best predict the outcome of a trial, given the available dataset. Through exploratory data analysis, I focused on aggregating the neural activity data using average spike counts across the time bins. I found that these spike counts changed across mice and sessions, differed between success and failure trials, and based on the condition of the visual stimuli in the task (to be explained below). I also noticed that a subset of time bins seemed to be more telling of a trial’s success or failure, than all 40 time bins. This led me to build 8 prediction models (not including the baseline naive model), after integrating the data I needed.

Section 1 Introduction

For this course project, the dataset that I worked with is a cleaned dataset, originating from Steinmetz, Zatka-Haas, Carandini, and Harris’s 2019 article titled Distributed coding of choice, action and engagement across the mouse brain. During the study, Neuropixels probes were used to record the neural activity in the mice’s brains. A total of 18 sessions of experiments were conducted on different days, and 4 mice participated: Cori, Forssmann, Hench, and Lederberg.

As a brief introduction, the dataset contains 5 features (contrast_left, contrast_right, time, spks, and brain_area) and 1 binary outcome variable (feedback_type) for each trial in each session. Therefore, the goal here is to build a binary classification model.

For all the trials within the same session, the matrices of the spks feature that contain the spike trains are of the same dimensions. The total number of rows of the matrix represent the total number of neurons in each trial, while each of the columns directly corresponds to the center of time bins given in the time feature for each trial. Additionally, contrast_left and contrast_right are just contrast of the stimuli on different sides.

Finally, as promised above, the condition of the visual stimuli has 4 possibilities. From this point on, the group Condition 1 refers to contrast_left being greater than contrast_right. Condition 2 refers to contrast_left being less than contrast_right. Condition 3 refers to contrast_left and contrast_right both being 0. And Condition 4 refers to both contrasts being equal but not 0.

Section 2 Exploratory analysis

My exploratory analysis begins with creating a dataframe to summarize the information across the 18 sessions for the 4 mice, which is converted into the table shown below. Each row represents a different session. The Name of Mouse column contains the name of the mouse associated with each session. The Date of Experiment column contains the date that each experiment was conducted. The Number of Neurons column contains the total number of neurons from which the number of spikes are recorded. The Number of Trials column contains the total number of trials conducted within each session. The Success Rate column contains the mouse’s proportion of success trials (when outcome of the trial is 1) for each session. This rate is calculated by converting the original feedback_type vectors from 1 for success and -1 for failure, to 1 for success and 0 for failure, then taking the mean of those converted values. I will not be focusing on the brain areas part of the dataset, so I haven’t included it as summary information here.

Summary Information Across 18 Sessions for the 4 Mice
Session Number Name of Mouse Date of Experiment Number of Neurons Number of Trials Success Rate1
1 Cori 2016-12-14 734 114 0.6053
2 Cori 2016-12-17 1070 251 0.6335
3 Cori 2016-12-18 619 228 0.6623
4 Forssmann 2017-11-01 1769 249 0.6667
5 Forssmann 2017-11-02 1077 254 0.6614
6 Forssmann 2017-11-04 1169 290 0.7414
7 Forssmann 2017-11-05 584 252 0.6706
8 Hench 2017-06-15 1157 250 0.6440
9 Hench 2017-06-16 788 372 0.6855
10 Hench 2017-06-17 1172 447 0.6197
11 Hench 2017-06-18 857 342 0.7953
12 Lederberg 2017-12-05 698 340 0.7382
13 Lederberg 2017-12-06 983 300 0.7967
14 Lederberg 2017-12-07 756 268 0.6940
15 Lederberg 2017-12-08 743 404 0.7649
16 Lederberg 2017-12-09 474 280 0.7179
17 Lederberg 2017-12-10 565 224 0.8304
18 Lederberg 2017-12-11 1090 216 0.8056
1 proportion of successful trials (feedback type is 1)

I can see that the number of neurons differs across sessions, with the least number of neurons in session 16, and the most in session 4. Since the amount of neurons differs, I will probably have to choose a summary statistic that can account for that, but still be representative of the neural data. I can also see that the number of trials differs too. Specifically, session 1 and 18 have the least number of trials, but that’s because 100 trials from each were randomly removed for the test sets. The most number of trials is in session 10. The success rates are different for each session, and are calculated from the feedback types suggest a general increase in success rate over time for all 4 mice. Additionally, there’s not an even amount of sessions for each mouse. For instance, Cori only has 3 sessions (sessions 1-3), while Lederberg has 7 sessions (sessions 12-18), which could indicate that session number and/or mouse name will be important in the predictive model.

Since the success rate differs for all the sessions, I was interested in developing a baseline success rate for a group of sessions, particularly grouped by who the mouse was. Here are the baseline rates:

Average Success Rate for Each of the 4 Mice
Mouse Name Success Rate1
Cori 0.6337
Forssmann 0.6850
Hench 0.6861
Lederberg 0.7639
1 took average of success rates over each mouse’s sessions

Cori’s overall success rate seems to be lowest, while Lederberg’s overall success rate seems to be highest. This reinforces the fact that the success rates are not consistent across mice, as suspected, so I will dive deeper into the mice’s sessions to look for more common trends or patterns within the neural data itself. Based on this table, I intend for my predictive models to at least be better than its corresponding mouse’s success rate. This means my intended model for test data from session 1 should have higher accuracy than 0.6337. And my intended model for test data from session 18 should have higher accuracy than 0.7639.

As hinted at in the analysis of the first table and to dive deeper into the mice’s sessions patterns, I ultimately chose to take the average number of spikes (i.e. the average spike count) across neurons at each of the 40 time bins as the neural activities to keep analyzing for the remainder of this report. I thought of using the average because it takes into account the varying number of neurons in each session, and is therefore more representative of a number than perhaps just the sum of the spike counts. Then, I can compare more fairly across sessions.

To continue, I wanted to explore how average spike counts change over the 40 time bins, conditional on various factors, and if certain levels of those factors could be combined. To do this, I first explored how the average spike counts changed, conditional on the different mice.

From this plot, I see a lot of variability from one time bin to another; the average spike count values keep fluctuating up and down. While the curves of best fit seem to be similarly shaped, their positions along the y-axis vary. Cori’s plot seems the most different, with the highest average spike count across the time bins. Forssmann and Lederberg’s plots seem to show more consistency in average spike counts, with a slight increase over time, but Lederberg’s average spike counts vary a lot more. Finally, Hench’s plot has the most noticeable trend of increase in average spike count. This seemed to me that some aspect of mouse name might be important to address in creating the predictive models. However, at the same time, due to the noticeable variability, the patterns across sessions instead of just across mice might hold more value for the prediction models if more similar patterned sessions could be analyzed together, which is the goal of the next few plots.

The average spike counts over time could differ not just between mice, but even between sessions for the same mouse, so I plotted the average spike counts over the 40 times bins for each of the 18 sessions to explore this.

Most of the sessions generally have an increasing trend, where average spike counts increase over the 40 time bins, but the average spike count values vary quite a bit, with some sessions like session 3 having higher counts and others like session 6 having lower counts. I also noticed that some sessions have more similar trends, such as sessions 1,8,9, and 11, as well as a lot of Lederberg’s sessions (as a refresher, those are session 12-18). Because of this variation, it seems we might be able to combine some sessions to analyze their similar trends, so it’s probably useful to keep the session number as part of the final predictive model. I would also argue that because of the clearer trend similarities and differences among various sessions, this variable will probably be more useful to keep than mouse name. However, these plots are still not as clear as I’d like because they provide no information about how those counts vary between success and failure trials, which is the outcome we’re trying to predict.

Therefore, I decided to plot the average spike counts of the 18 sessions again but this time separate them by success and failure trials, one line of best fit for each on every smaller plot.

From this plot, I discovered a few things. One, I wasn’t expecting this, but I start to see a larger gap (most difference) between the average spike counts for success and failure trials across most of the sessions at around time bin 15. Therefore, I believe that the time bins matter in determining the trial outcomes, and I can try reducing the amount of time bins to use in the final prediction model to only time bins 15 to 40. Two, from what I can see, there’s a noticeable difference in average spike counts between failure and success trials for most if not all the sessions. The main pattern is that success trials mostly have a higher average spike count across time than failure trials. Three, although this main pattern is seen, I also noticed that sessions for the same mouse did not all have the same average spike counts values, such as session 13 vs session 14 which are both from Lederberg having very different values.

This means that even if trials in these sessions have similar trends like in session 13 and 14, the model might still predict completely incorrect outcomes. For instance, for session 13, an average spike count of 0.03 would be considered a failure trial, but for session 14, that same value would be considered a success outcome. This suggests two things to me. The first is that I should be choosing sessions that have similar trends to sessions 1 and 18, and build 2 separate predictive models for each one, because training on similar data would be more useful for testing accuracy than training on a variety of data with different trends. And secondly, I will adjust the average spike counts so that similar trends will have their counts at similar values.

Since a part of the test data will come from the first session and the trends between session 1-3 seem comparable, I’m first looking more closely at Cori’s sessions (1-3). Continuing with splitting by success and failure trials, here’s the plot after adjusting the average spike counts to be at similar values:

This plot shows me that the failure trend lines are arguably similar, but perhaps the success trend lines may be less similar, because in session 1, there seems to be a relatively larger peak at around time bin 25, whereas the other sessions don’t peak as high. However, since these 3 sessions are from the same mouse, Cori, I will not disgard their data just yet, but keep it to try fitting a model on, in case there’s no better, more accurate alternative model.

Since a part of the test data will also come from the last session (18), and the trends between session 12-18 seem comparable, I’m going to next take a look at Lederberg’s sessions 12-18. Continuing with splitting by success and failure trials, here’s another plot after adjusting the average spike counts to be at similar values:

From this plot, I can see that the zoomed-in trend lines for both success and failure outcomes in session 18 are quite unique, so the closest patterns I can see are the other Lederberg sessions shown above. The patterns across sessions 12-17 are still decently similar because their success trials essentially all form a steep increase in average spike counts, then start decreasing in a similar fashion. As for the failure trials, the main consensus among the trends was that they mostly still maintained a lower average spike count than success trials and the larger gap in later time bins was quite noteworthy. Also because they are all from Lederberg, they are most reasonable to use in the predictive models.

Next, to circle back to having another option for session 1’s predictive model, I’ve looked more closely at the average spike count trends across sessions (separated by success and failure), and it seems to me that sesssion 8, 9, and 11 might also have similar trends to session 1. Again, I’ve continued with splitting by success and failure trials, and here’s another plot after adjusting the average spike counts to be at similar values:

From this plot, I can see that the zoomed-in trend lines for both success and failure outcomes in these four subplots are arguably more similar than before, but now it’s not just from Cori, but also from Hench. In this case, I think using another mouse’s data can be useful in getting more training data, as the trends are quite similar, indicating that it’s not just random unrelated data that I’m adding to the predictive model. The patterns across sessions 12-17 are still decently similar because their success trials essentially all form a steep increase in average spike counts, then start decreasing in a similar fashion. And the failure trials are a bit flatter, while the gap between failure and success trials becomes larger as time bins increase, indicating that again, we could try only using the later time bins in our predictive model because that’s where the failure and success trials start to more noticeably be able to be separated.

Furthermore, I wanted to explore if I could get some information about the visual stimuli and see if I could combine any conditions with similar trends to potentially reduce model complexity, through seeing if there’s a difference in average spike counts across the 4 conditions explained in the introduction. I chose to capture this potential difference by switching from just average spike counts on the y-axis to the difference between the success and failure trials’ average spike counts (meaning success minus failure average spike counts), in order to make the plot look less chaotic.

Overall, Condition 4 does not seem like it fits the pattern of the other conditions, so I will not combine that with other conditions. Condition 3 also has the same issue, so that will still be kept separate as well. However, based on these plots, it seems that oftentimes Condition 1 and 2 (the conditions where left and right stimuli contrasts are not equal) seem quite similar in terms of the line of best fit’s shape and position, so I’ll probably explore combining those conditions in my predictive models. (I noticed that sessions 1, 4, 7, 10, 13, and 16 seemed quite similar in terms of the trends of the 4 lines, but didn’t have time to explain more than that.)

To sum up this section, the main takeaways are that I will try fitting predictive models for Session 1. One way will be using Session 1-3, while the other way will be using Sessions 1,8,9,11. For the predictive model for Session 18, I will be using Sessions 12-18. And I will be combining Conditions 1 and 2 to see if model accuracy will improve.

Section 3 Data integration

I created 7 main dataframes in preparation for creating the predictive models. For all of them, Feedback is the outcome variable column, with values 1 for success and -1 for failure trials. I’ve also kept all the time bins because in the predictive model it will be relatively simple to only get a subset of the time bins for testing.

This first and main dataframe contains the average spike counts for all 18 sessions, averaging the counts across the neurons in that trial at each of the 40 time bins. Each row is a trial, and I’ve kept the 4 conditions. This will help me build baseline models to compare to.

Data Integration of Average Spike Counts for All Sessions Across All 40 Time Bins (first 15 rows, first 14 columns)
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
1 1 1 2 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191
1 2 1 3 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341
1 3 -1 2 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422
1 4 -1 3 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354
1 5 -1 3 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163
1 6 1 3 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191
1 7 1 1 0.0545 0.0368 0.0368 0.0381 0.0490 0.0341 0.0163 0.0232 0.0381 0.0191
1 8 1 1 0.0613 0.0232 0.0327 0.0259 0.0327 0.0381 0.0490 0.0545 0.0463 0.0368
1 9 1 3 0.0286 0.0218 0.0313 0.0204 0.0368 0.0245 0.0504 0.0504 0.0395 0.0232
1 10 1 1 0.0218 0.0313 0.0300 0.0327 0.0368 0.0395 0.0490 0.0327 0.0327 0.0409
1 11 1 1 0.0150 0.0218 0.0259 0.0395 0.0368 0.0286 0.0490 0.0518 0.0381 0.0409
1 12 1 2 0.0327 0.0409 0.0368 0.0300 0.0327 0.0409 0.0463 0.0490 0.0545 0.0477
1 13 -1 4 0.0368 0.0272 0.0245 0.0368 0.0286 0.0450 0.0286 0.0490 0.0341 0.0599
1 14 -1 3 0.0354 0.0504 0.0381 0.0381 0.0422 0.0490 0.0450 0.0327 0.0272 0.0300
1 15 -1 3 0.0232 0.0286 0.0559 0.0327 0.0504 0.0272 0.0450 0.0191 0.0354 0.0313
Note:
This is a subset of a dataframe with 5081 rows and 44 columns
1 session numbers go from 1 to 18
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40

The next dataframe is in preparation for Session 1’s predictive model. It is the same as the dataframe above except instead of all 18 sessions, I have only included sessions 1-3, as concluded in the main takeaways of the exploratory data analysis. There are 4 conditions.

Data Integration of Average Spike Counts for Sessions 1-3 (Cori’s Sessions) Across All 40 Time Bins (first 15 rows, first 14 columns)
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
1 1 1 2 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191
1 2 1 3 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341
1 3 -1 2 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422
1 4 -1 3 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354
1 5 -1 3 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163
1 6 1 3 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191
1 7 1 1 0.0545 0.0368 0.0368 0.0381 0.0490 0.0341 0.0163 0.0232 0.0381 0.0191
1 8 1 1 0.0613 0.0232 0.0327 0.0259 0.0327 0.0381 0.0490 0.0545 0.0463 0.0368
1 9 1 3 0.0286 0.0218 0.0313 0.0204 0.0368 0.0245 0.0504 0.0504 0.0395 0.0232
1 10 1 1 0.0218 0.0313 0.0300 0.0327 0.0368 0.0395 0.0490 0.0327 0.0327 0.0409
1 11 1 1 0.0150 0.0218 0.0259 0.0395 0.0368 0.0286 0.0490 0.0518 0.0381 0.0409
1 12 1 2 0.0327 0.0409 0.0368 0.0300 0.0327 0.0409 0.0463 0.0490 0.0545 0.0477
1 13 -1 4 0.0368 0.0272 0.0245 0.0368 0.0286 0.0450 0.0286 0.0490 0.0341 0.0599
1 14 -1 3 0.0354 0.0504 0.0381 0.0381 0.0422 0.0490 0.0450 0.0327 0.0272 0.0300
1 15 -1 3 0.0232 0.0286 0.0559 0.0327 0.0504 0.0272 0.0450 0.0191 0.0354 0.0313
Note:
This is a subset of a dataframe with 593 rows and 44 columns
1 session numbers go from 1 to 3
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40
The third dataframe here is the same as the one above, except now I’ve combined conditions 1 and 2 into the new condition 1, for further testing in the predictive modeling phase. This also means the original condition 3 has become the new condition 2, and the original condition 4 has become the new condition 3.
Same Data Integration as Table Above, Except Conditions 1 and 2 are Combined into One Condition
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
1 1 1 1 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191
1 2 1 2 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341
1 3 -1 1 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422
1 4 -1 2 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354
1 5 -1 2 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163
1 6 1 2 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191
1 7 1 1 0.0545 0.0368 0.0368 0.0381 0.0490 0.0341 0.0163 0.0232 0.0381 0.0191
1 8 1 1 0.0613 0.0232 0.0327 0.0259 0.0327 0.0381 0.0490 0.0545 0.0463 0.0368
1 9 1 2 0.0286 0.0218 0.0313 0.0204 0.0368 0.0245 0.0504 0.0504 0.0395 0.0232
1 10 1 1 0.0218 0.0313 0.0300 0.0327 0.0368 0.0395 0.0490 0.0327 0.0327 0.0409
1 11 1 1 0.0150 0.0218 0.0259 0.0395 0.0368 0.0286 0.0490 0.0518 0.0381 0.0409
1 12 1 1 0.0327 0.0409 0.0368 0.0300 0.0327 0.0409 0.0463 0.0490 0.0545 0.0477
1 13 -1 3 0.0368 0.0272 0.0245 0.0368 0.0286 0.0450 0.0286 0.0490 0.0341 0.0599
1 14 -1 2 0.0354 0.0504 0.0381 0.0381 0.0422 0.0490 0.0450 0.0327 0.0272 0.0300
1 15 -1 2 0.0232 0.0286 0.0559 0.0327 0.0504 0.0272 0.0450 0.0191 0.0354 0.0313
Note:
This is a subset of a dataframe with 593 rows and 44 columns
1 session numbers go from 1 to 3
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40

The fourth dataframe is in preparation for Session 18’s predictive model. It is the same as the main dataframe above except instead of all 18 sessions, I have only included sessions 12-18, as concluded in the main takeaways of the exploratory data analysis. There are four conditions.

Data Integration of Average Spike Counts for Sessions 12-18 (Lederberg’s Sessions) Across All 40 Time Bins (first 15 rows, first 14 columns)
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
12 1 1 3 0.0179 0.0222 0.0265 0.0294 0.0165 0.0194 0.0337 0.0222 0.0222 0.0380
12 2 -1 3 0.0623 0.0623 0.0566 0.0623 0.0537 0.0451 0.0509 0.0351 0.0451 0.0509
12 3 1 3 0.0394 0.0351 0.0337 0.0308 0.0294 0.0337 0.0280 0.0308 0.0308 0.0294
12 4 1 1 0.0394 0.0365 0.0408 0.0380 0.0308 0.0437 0.0408 0.0466 0.0294 0.0494
12 5 1 2 0.0380 0.0408 0.0365 0.0251 0.0265 0.0265 0.0280 0.0251 0.0280 0.0494
12 6 1 4 0.0294 0.0237 0.0394 0.0351 0.0237 0.0337 0.0280 0.0337 0.0480 0.0666
12 7 1 2 0.0208 0.0265 0.0136 0.0337 0.0237 0.0337 0.0480 0.0408 0.0394 0.0466
12 8 1 2 0.0179 0.0351 0.0337 0.0351 0.0208 0.0322 0.0451 0.0595 0.0394 0.0408
12 9 1 1 0.0294 0.0380 0.0408 0.0437 0.0494 0.0251 0.0480 0.0423 0.0480 0.0380
12 10 1 2 0.0308 0.0351 0.0308 0.0136 0.0165 0.0308 0.0237 0.0480 0.0337 0.0351
12 11 1 2 0.0280 0.0380 0.0351 0.0394 0.0165 0.0251 0.0294 0.0337 0.0322 0.0494
12 12 1 2 0.0537 0.0494 0.0466 0.0380 0.0308 0.0437 0.0509 0.0595 0.0752 0.0451
12 13 1 1 0.0308 0.0208 0.0337 0.0322 0.0451 0.0365 0.0208 0.0237 0.0351 0.0509
12 14 1 3 0.0337 0.0322 0.0394 0.0251 0.0280 0.0251 0.0337 0.0280 0.0222 0.0194
12 15 -1 3 0.0294 0.0423 0.0523 0.0280 0.0337 0.0394 0.0337 0.0351 0.0394 0.0437
Note:
This is a subset of a dataframe with 2032 rows and 44 columns
1 session numbers go from 12 to 18
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40
The fifth dataframe here is the same as the fourth one, except now I’ve again combined conditions 1 and 2 into the new condition 1, for further testing in the predictive modeling phase. This also means the original condition 3 has become the new condition 2, and the original condition 4 has become the new condition 3.
Same Data Integration as Table Above, Except Conditions 1 and 2 are Combined into One Condition
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
12 1 1 2 0.0179 0.0222 0.0265 0.0294 0.0165 0.0194 0.0337 0.0222 0.0222 0.0380
12 2 -1 2 0.0623 0.0623 0.0566 0.0623 0.0537 0.0451 0.0509 0.0351 0.0451 0.0509
12 3 1 2 0.0394 0.0351 0.0337 0.0308 0.0294 0.0337 0.0280 0.0308 0.0308 0.0294
12 4 1 1 0.0394 0.0365 0.0408 0.0380 0.0308 0.0437 0.0408 0.0466 0.0294 0.0494
12 5 1 1 0.0380 0.0408 0.0365 0.0251 0.0265 0.0265 0.0280 0.0251 0.0280 0.0494
12 6 1 3 0.0294 0.0237 0.0394 0.0351 0.0237 0.0337 0.0280 0.0337 0.0480 0.0666
12 7 1 1 0.0208 0.0265 0.0136 0.0337 0.0237 0.0337 0.0480 0.0408 0.0394 0.0466
12 8 1 1 0.0179 0.0351 0.0337 0.0351 0.0208 0.0322 0.0451 0.0595 0.0394 0.0408
12 9 1 1 0.0294 0.0380 0.0408 0.0437 0.0494 0.0251 0.0480 0.0423 0.0480 0.0380
12 10 1 1 0.0308 0.0351 0.0308 0.0136 0.0165 0.0308 0.0237 0.0480 0.0337 0.0351
12 11 1 1 0.0280 0.0380 0.0351 0.0394 0.0165 0.0251 0.0294 0.0337 0.0322 0.0494
12 12 1 1 0.0537 0.0494 0.0466 0.0380 0.0308 0.0437 0.0509 0.0595 0.0752 0.0451
12 13 1 1 0.0308 0.0208 0.0337 0.0322 0.0451 0.0365 0.0208 0.0237 0.0351 0.0509
12 14 1 2 0.0337 0.0322 0.0394 0.0251 0.0280 0.0251 0.0337 0.0280 0.0222 0.0194
12 15 -1 2 0.0294 0.0423 0.0523 0.0280 0.0337 0.0394 0.0337 0.0351 0.0394 0.0437
Note:
This is a subset of a dataframe with 2032 rows and 44 columns
1 session numbers go from 12 to 18
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40

The sixth dataframe is in preparation for Session 1’s alternative predictive model. It is the same as the main dataframe above except instead of all 18 sessions, I have only included sessions 1,8,9,11, as concluded in the main takeaways of the exploratory data analysis. There are four conditions.

Data Integration of Average Spike Counts for Sessions 1,8,9,11 Across All 40 Time Bins (first 15 rows, first 14 columns)
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
1 1 1 2 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191
1 2 1 3 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341
1 3 -1 2 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422
1 4 -1 3 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354
1 5 -1 3 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163
1 6 1 3 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191
1 7 1 1 0.0545 0.0368 0.0368 0.0381 0.0490 0.0341 0.0163 0.0232 0.0381 0.0191
1 8 1 1 0.0613 0.0232 0.0327 0.0259 0.0327 0.0381 0.0490 0.0545 0.0463 0.0368
1 9 1 3 0.0286 0.0218 0.0313 0.0204 0.0368 0.0245 0.0504 0.0504 0.0395 0.0232
1 10 1 1 0.0218 0.0313 0.0300 0.0327 0.0368 0.0395 0.0490 0.0327 0.0327 0.0409
1 11 1 1 0.0150 0.0218 0.0259 0.0395 0.0368 0.0286 0.0490 0.0518 0.0381 0.0409
1 12 1 2 0.0327 0.0409 0.0368 0.0300 0.0327 0.0409 0.0463 0.0490 0.0545 0.0477
1 13 -1 4 0.0368 0.0272 0.0245 0.0368 0.0286 0.0450 0.0286 0.0490 0.0341 0.0599
1 14 -1 3 0.0354 0.0504 0.0381 0.0381 0.0422 0.0490 0.0450 0.0327 0.0272 0.0300
1 15 -1 3 0.0232 0.0286 0.0559 0.0327 0.0504 0.0272 0.0450 0.0191 0.0354 0.0313
Note:
This is a subset of a dataframe with 1078 rows and 44 columns
1 session numbers are 1,8,9,11
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40
Finally, the seventh dataframe here is the same as the sixth one, except now I’ve again combined conditions 1 and 2 into the new condition 1, for further testing in the predictive modeling phase. This also means the original condition 3 has become the new condition 2, and the original condition 4 has become the new condition 3.
Same Data Integration as Table Above, Except Conditions 1 and 2 are Combined into One Condition
Session Number1 Trial Number2 Feedback Condition bin13 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 bin10
1 1 1 1 0.0490 0.0368 0.0177 0.0150 0.0327 0.0286 0.0313 0.0123 0.0341 0.0191
1 2 1 2 0.0300 0.0313 0.0341 0.0272 0.0259 0.0313 0.0218 0.0232 0.0232 0.0341
1 3 -1 1 0.0490 0.0504 0.0300 0.0436 0.0245 0.0409 0.0300 0.0381 0.0341 0.0422
1 4 -1 2 0.0559 0.0531 0.0272 0.0613 0.0572 0.0599 0.0450 0.0286 0.0395 0.0354
1 5 -1 2 0.0272 0.0436 0.0313 0.0245 0.0450 0.0381 0.0463 0.0572 0.0477 0.0163
1 6 1 2 0.0490 0.0218 0.0163 0.0109 0.0123 0.0232 0.0272 0.0327 0.0163 0.0191
1 7 1 1 0.0545 0.0368 0.0368 0.0381 0.0490 0.0341 0.0163 0.0232 0.0381 0.0191
1 8 1 1 0.0613 0.0232 0.0327 0.0259 0.0327 0.0381 0.0490 0.0545 0.0463 0.0368
1 9 1 2 0.0286 0.0218 0.0313 0.0204 0.0368 0.0245 0.0504 0.0504 0.0395 0.0232
1 10 1 1 0.0218 0.0313 0.0300 0.0327 0.0368 0.0395 0.0490 0.0327 0.0327 0.0409
1 11 1 1 0.0150 0.0218 0.0259 0.0395 0.0368 0.0286 0.0490 0.0518 0.0381 0.0409
1 12 1 1 0.0327 0.0409 0.0368 0.0300 0.0327 0.0409 0.0463 0.0490 0.0545 0.0477
1 13 -1 3 0.0368 0.0272 0.0245 0.0368 0.0286 0.0450 0.0286 0.0490 0.0341 0.0599
1 14 -1 2 0.0354 0.0504 0.0381 0.0381 0.0422 0.0490 0.0450 0.0327 0.0272 0.0300
1 15 -1 2 0.0232 0.0286 0.0559 0.0327 0.0504 0.0272 0.0450 0.0191 0.0354 0.0313
Note:
This is a subset of a dataframe with 1078 rows and 44 columns
1 session numbers are 1,8,9,11
2 trial numbers go from 1 to the number of trials in that session
3 time bins go from 1 to 40

I also created matrices from these dataframes in order to feed into the predictive models below. In those models, I will also be able to use a subset of the time bin columns.

Section 4 Predictive modeling

I created a baseline naive model to make sure my models are all at least better than that. Then, I built 8 other models to compare against each other in order to find the two best predictive models for predicting Session 1 and 18 test data. Within models 3-8, I tried fitting them once with the combined first two conditions, and once with them not combined, then picked the one that produced the better results. But in those models I only used time bins 15-40.

I’m using Accuracy as a simple metric to gauge how well the model is performing on unseen data because it measures what proportion of all the predictions made were actually correct predictions (true positives and true negatives, or in our case true successes and true failures). Therefore, we want to aim for higher accuracy. However, in this project since I mainly care about success outcomes, I want to also take into account the false positives, which Accuracy doesn’t do. Since Accuracy also may be insufficient in dealing with imbalanced data (in our case, we can see from the original success rates that there are more 1’s than -1’s for the outcome variable), I’m also going to evaluate area under the ROC curve.

ROC stands for Receiver operating characteristic and the area under this ROC curve (AUC) is useful as a performance metric for binary classification. The ROC curve itself shows true positive rate versus false positive rate as the classification thresholds change. But perhaps more useful is taking the AUC, which measures how well the model is able to distinguish success and failure outcomes at all classification thresholds. And because it is classification-threshold-invariant, it is a more stable metric than Accuracy. Again, we still want to aim for higher AUC.

Since our goal here is binary classification, the models I chose to use are logistic regression and XGBoost.

I used Logistic Regression because it’s is a simple model for predicting binary outcomes in supervised machine learning, which is what we have with the target feedback variable of success (1) and failure (-1). It does regression on the probabilities of the target variable outcomes being in the success category versus the failure category. When building the models below, I adjusted the cut-off threshold to try to get the highest accuracy model within the constraints of my data integration.

I also used XGBoost because it can be used to train a binary classification model and deals well with multicollinearity and nonlinear relationships between predictor variables. While it has its downfalls of having many hyperparameters to try tuning, increasing complexity, I think it is useful for the data that I’ve kept because I think at the very least my time bins variable are probably correlated and perhaps I might be able to only keep the most important ones (e.g. time bins 15-40), and it deals with nonlinear relationships automatically so I don’t have worry about that. When building the models below, I adjusted the nrounds and eta hyperparameters, and eventually the cut-off threshold, to try to get the highest accuracy model within the constraints of my data integration.


Baseline Naive Model: Predicting all 1’s for the outcome (predicting all successes)

Accuracy: 0.7103
Area under the ROC Curve: 0.5


Model 1: Logistic Regression using all 18 Sessions, all Trials, all 40 Time Bins, all 4 Conditions

Accuracy: 0.7182
Area under the ROC Curve: 0.6962


Model 2: XGBoost using all 18 Sessions, all Trials, all 40 Time Bins, all 4 Conditions

Accuracy: 0.736
Area under the ROC Curve: 0.7029


Model 3: XGBoost using Sessions 1-3, all their Trials, Time Bins 15-40, Combined Conditions 1&2

Accuracy: 0.7179
Area under the ROC Curve: 0.7102


Model 4: XGBoost using Sessions 1,8,9,11, all their Trials, Time Bins 15-40, all 4 Conditions

Accuracy: 0.7395
Area under the ROC Curve: 0.7336


Model 5: XGBoost using Sessions 12-18, all their Trials, Time Bins 15-40, Combined Conditions 1&2

Accuracy: 0.7857
Area under the ROC Curve: 0.7733


Model 6: Logistic Regression using Sessions 1-3, all their Trials, Time Bins 15-40, Combined Conditions 1&2

Accuracy: 0.7179
Area under the ROC Curve: 0.7279


Model 7: Logistic Regression using Sessions 1,8,9,11, all their Trials, Time Bins 15-40, Combined Conditions 1&2

Accuracy: 0.7814
Area under the ROC Curve: 0.7427


Model 8: Logistic Regression using Sessions 12-18, all their Trials, Time Bins 15-40, all 4 Conditions

Accuracy: 0.7586
Area under the ROC Curve: 0.7311


I’ve also created a plot of the ROC curves for the Naive Model, and the two best models I decided to use as my final predictive models for Session 1 (Model 7 using Logistic Regression) and Session 18 (Model 5 using XGBoost) based on highest accuracy and AUC.

From this plot, it can be seen that Model 5 (red colored) appears to have the highest AUC because it’s farthest from the baseline model linear blue line, so its ROC curve is slightly better than Model 7’s (green curve). However, Model 7’s is still good, as it’s noticeably far from the naive model line. This suggests that both of the models are performing better than the naive model, which is what my original goal. This is confirmed by the accuracy, of which Model 5’s is the best at 0.7857, Model 7’s is second best at 0.7814, and they’re both better than the Naive Model’s 0.7103. Also, the AUC of Model 5 is the best, at 0.7733, while Model 7’s is second best at 0.7427, and both much better than the Naive Model’s 0.5.

Section 5 Prediction performance on the test sets

I will continue using Accuracy and AUC (of which we want higher values to indicate better model performance) to evaluate the performance of the models on test data from Session 1 and 18 because they are good metrics of success (positive) outcomes, since we want to know how successfully the models can make correct predictions on unseen data.


Prediction Performance of Test Data from Session 1

The plot I’ve included here is to confirm that this test data has similar patterns to session 1,8,9,11, and that the average spike count values are in the same range, so that the model is less likely to predict incorrectly (for the purpose of the plot, I’ve set test1 data to be “session 19”).

From the plot, I can confirm that both statements are true to a fair extent. Then I proceeded to integrate the data in the same way as above, and evaluate the performance using the same metrics, Accuracy and AUC.

Accuracy: 0.77
Area under the ROC Curve: 0.7599

From these results, I conclude that while the accuracy is lower than the original Model 7, the AUC is actually slightly higher, which could be due to natural variation in the data, but overall the predictive performance was on par with the original test data subset’s performance, so the trained logistic regression model seems to still be appropriate on unseen data.


Prediction Performance of Test Data from Session 18

The plot I’ve included here is to confirm that this test data has similar patterns to sessions 12-18, and that the average spike count values are in the same range, so that the model is less likely to predict incorrectly (for the purpose of the plot, I’ve set test2 data to be “session 20”). Originally plotting it, session 20’s average spike counts were too low, so I increased them so that know the values across these sessions are comparable, for the model.

From the plot, I can confirm that both statements are now true. Then I proceeded to integrate the data in the same way as before, and evaluate the performance using the same metrics, Accuracy and AUC.

Accuracy: 0.75
Area under the ROC Curve: 0.6809

From these results, I noticed that accuracy was a bit lower than the original Model 5, and that AUC is significantly lower, which could suggest the original model was overfitting. So overall the predictive performance of Model 5 on unseen data is not as satisfactory.

Both models’ predictive performances on unseen data were still better than the Naive model, in terms of both accuracy and AUC.

Section 6 Discussion

In conclusion, from exploring the data initially, I was interested in focusing on the neural analysis mainly through average spike counts, time bins, visual stimuli conditions. To build the binary classification predictive models, I chose to try out logistic regression and XGBoost, two model types that have their pros and cons. Based on my exploratory data analysis and data integration process, I built the best two final models I could within the time allotted, which ultimately resulted in a logistic regression model using sessions 1,8,9,and 11 for Session 1’s predictive model, and an XGBoost model using sessions 12-18 for Session 18’s predictive model. They were chosen by highest accuracy and AUC. Both resulted in accuracy and AUC values above the Naive model’s metrics, in the training phase and on the test data, which was good. However it seems that both model performances decreased on unseen data, which was expected, but could likely be improved on, especially the XGBoost model. Thus, there’s some limitations to my analysis and potential improvements I could explore in the future to be discussed below.

Some limitations are that I didn’t have time to explore the brain areas, which may have given more insight into which neurons and/or trials could be combined to enhance predictive performance. I could also spend more time in the future tuning the hyperparameters, especially of my XGBoost model to possibly achieve a higher AUC even on unseen data.

Acknowledgements

Professor and TA notes

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20(receiver%20operating,True%20Positive%20Rate

https://www.linkedin.com/advice/0/how-can-you-use-accuracy-evaluation-metric-skills-machine-learning